Introduction to Statistics

Bennett Kleinberg

Week 4

Week 4

  • Part 1: distributions and samples
  • Part 2: intro to hypothesis testing
  • Part 3: errors in inference

PART 1: Distributions and samples

Samples again!

The underlying idea

  • there is a population
  • but we never have access to it
  • theoretical argument

So sampling is our only rescue.

Our data

  • suppose we have access to the whole population
  • i.e. we know the attribute value from every person
  • here: the height of every adult in the Netherlands (say: 10m)

Population parameters (in cm):

\(\mu = 175\) and \(\sigma = 7\)

Population data

Our full population of 10m adults:

id height
1 174
2 198
9999999 156
10000000 180

Histogram

Drawing samples

We are the researchers now:

We do not know all this about the population!!!

We want to know how tall Dutch adults are.

  • we cannot ask every adult in the Netherlands
  • so we need to sample

Let’s start…

Sample of 3

\(n=3\)

id height
693610 158.61
8177752 181.76
9426218 172.30

\(M = \frac{\sum{{X}}}{n} = \frac{158.61+181.76+172.30}{3} = \frac{512.67}{3} = 170.89\)

\(SD = \sqrt{\frac{SS}{n-1}}\)

\(SS =\sum{{(X-M)^2}} = (158.61-175.00)^2+... = 321.62\)

\(SD = \sqrt{\frac{321.62}{3-1}} =\sqrt{160.81} =12.68\)

Sampling error

When we obtain statistics from our data, we talk about:

  • population parameters
  • sample statistics

The sampling error is the difference between the two.

Here:

  • \(\mu = 175\) but \(M=170.89\)
  • \(\sigma = 7\) but \(SD=12.68\)

What if we sampled again?

We repeat the sampling process: now we take a sample of \(n=3\) twice.

sample n mean_height
1 3 170.89
2 3 175.21

Even more…

Repeated sampling: 10 times a sample of \(n=3\)

sample n mean_height
1 3 170.89
2 3 175.21
3 3 175.40
4 3 172.42
5 3 177.65
6 3 180.61
7 3 169.51
8 3 179.00
9 3 179.69
10 3 174.82

Distribution of means

We repeat it 1,000 times

Even better…

Why don’t we also increase sample size \(n\)?

sample n mean_height
1 20 175.17
2 20 174.15
3 20 178.14
4 20 173.99
5 20 173.45
6 20 177.54
7 20 175.86
8 20 172.75
9 20 176.58
10 20 175.04

Mean of the means

We now have sampled 10 times with \(n=20\).

The mean of the 10 means is:

## [1] 175.27

Think back to what we did?

We have increased the number of samples and the sample size to reduce the sampling error

Something strange happened

Many samples, small n

Large samples, massive n

Now what?

Remember: we want to estimate the mean \(M\) (we do not ever have access to the population).

Do we thus need to take many, many samples with big sample sizes?

Luckily, there is a mathematical theorem to our rescue!

The Central Limit Theorem (CLT)

The central limit theorem states that:

  • for a population with mean \(\mu\) and standard deviation \(\sigma\),
  • the distribution of the sample means (each with sample size \(n\)), has:
    • a mean of \(\mu\)
    • and a standard deviation of \(\frac{\sigma}{\sqrt{n}}\)

And: it will approach the normal distribution with increasing \(n\)

Rewind…

  • We know we can deal with the normal distribution (z-scores, probabilities)
  • The CLT now states that with increasing sample size \(n\), the distribution of sample means becomes normal
  • of a population that has any kind of distribution shape!

This is like a life saver.

Characteristics of the sample mean distr.

Shape:

The distribution of sample means approaches the normal distribution if:

  • the population is a normal distribution, or
  • \(n\) becomes large (rule of thumb: \(n > 30\))

Characteristics of the sample mean distr.

Central tendency (mean):

  • if we have all possible sample combinations: by definition is \(\mu = M\)
  • remember: this is why \(M\) is an unbiased statistic

But we do not always have all possible samples (actually: never!).

So we know that \(\mu \approx M\). Thus we need some kind of “variability indicator” for the mean (of the sample means)…

Characteristics of the sample mean distr.

Variability of the mean: the standard error of the mean

  • Same idea as the standard deviation
  • In fact: it is the standard deviation of the means

\(SE = \sigma_M = \frac{\sigma}{\sqrt{n}}\)

This can also be written as: \(SE = \sqrt{\frac{\sigma^2}{n}}\)

Example

We take a sample of \(n=1\) from our height data and get:

## [1] 170.77

The standard error here is \(SE = \frac{\sigma}{\sqrt{n}} = \frac{7}{\sqrt{1}} = 7\)

With \(n=1\), \(SE = \sigma\).

Increasing \(n\)

Remember, our population had \(\mu=175\) and \(\sigma=7\).

n SE
1 7.00
2 4.95
3 4.04
4 3.50
5 3.13
10 2.21
100 0.70
1000 0.22

Visually

Putting it all together

If we know:

  • the population mean \(\mu\)
  • the population standard deviation \(\sigma\)
  • the size of the sample \(n\)

Then we can use the CLT to find the shape, mean and standard deviation (standard error) of the distribution of sampling means!

Example

Our height data with \(\mu=175\) and \(\sigma=7\).

We take a sample of \(n=60\).

So, the distribution of sample means has:

  • a mean of \(\mu=175\)
  • a standard error of \(SE = \frac{\sigma}{\sqrt{n}} = \frac{7}{\sqrt{60}} = 0.90\)
  • the shape of a normal distribution

Our sample means distribution

Now comes the magic

z-scores again

Given our height data with \(\mu=175\) and \(\sigma=7\):

We take a sample of \(n=100\). What is is the probability that the mean height of that sample is 177cm or higher?

Information about the distribution of sample means step-wise:

  1. mean of \(\mu=175\)
  2. standard error of \(SE = \frac{\sigma}{\sqrt{n}} = \frac{7}{\sqrt{100}} = 0.70\)
  3. the shape of a normal distribution

Hypothetical distr.

z-score logic

Obtain z-score:

  • \(z=\frac{M-\mu}{\sigma_M} = \frac{177-175}{0.70} = \frac{2}{0.70} = 2.86\)

Locate area of interest:

  • We are looking at a “X or higher” problem and we are above the mean:
  • so we need the tail proportions for \(z=2.86\)

z-score logic

Translate proportions to probabilities:

  • tail prop. for \(z=2.86\) –> .0021
  • \(p=.0021\)

The probability of the sample of \(n=100\) having a mean of 177 or higher is 0.0021 (0.21%)

Important

We can calculate two kinds of z-scores:

  1. z-scores for single scores, where \(z=\frac{X-\mu}{\sigma}\)
  2. z-scores for sample means, where \(z=\frac{M-\mu}{\sigma_M}\)

Note: in hypothesis testing, we are mostly interested in sample mean comparisons!

Law of large numbers

Sampling error vs standard error

Sampling Error = population parameters - sample statistics:

  1. We know that if \(n\) becomes large, \(SE\) becomes smaller.
  2. If \(SE\) becomes smaller, then \(M\) is getting closer and closer to \(\mu\)

In other words: with increasing \(n\), we decrease the standard error \(SE\), and thereby reduce the sampling error!

PART 2: Hypothesis testing

Core idea

We want to test a hypothesis about a population.

Because we cannot ever have access to the whole population, we need to work with a sample.

i.e. we are interested in making an inference (we are now entering inferential statistics territory) about a population from a sample

Take this example

Extra lessons for the intro to statistics exam.

Let’s walk through this step-by-step.

The scenario

Suppose the “intro to stats” exam grade \(X\) form this distribution \(X \sim N(6.9, 1.1)\).

You are now testing whether extra lessons has an effect on the exam grade.

So you are formulating a hypothesis as follows:

  • Conceptually: if a student took extra lessons, the exam grade was higher than without extra lessons.
  • Statistical hypotheses:
    • null hypothesis: there is no effect of having extra lessons
    • alternative hypothesis: extra lessons increase the grade

Two kinds of statistical hypotheses

The null hypothesis:

  • states that there is no effect (“null”)
  • here: the mean of the sample of students who had extra lessons is the same as the mean of the population
  • notation: \(H_0: \mu = 6.9\)

The alternative hypothesis:

  • the effect hypothesis: states that there is an effect (the alternative to null)
  • here: the mean of the sample of students who had extra lessons is higher than the mean of the population
  • notation: \(H_1: \mu > 6.9\) (also common: \(H_A: \mu > 6.9\))

The population

Back to our question

You are now testing whether extra lessons has an effect on the exam grade.

  • \(H_0: \mu = 6.9\)
  • \(H_1: \mu > 6.9\)

You have access to a sample of \(n=49\) students who took extra lessons.

The logic of NHST (1)

NHST = null hypothesis significance testing

  1. we know that under \(H_0\) we expect a sample mean of \(M=6.9\)
  2. it doesn’t have to be exactly \(M=6.9\): if the null hypothesis were supported, we would find a sample mean close to that value
  3. we can use the CLT to draw up the distribution of the sample means under \(H_0\)

We know that the distr. of sample means under \(H_0\) with \(n=49\) has a mean of \(\mu=6.9\) and a standard deviation of \(\sigma_m = \frac{\sigma}{\sqrt{n}} = \frac{1.10}{7} = 0.16\)

Sample means under the null

The logic of NHST (2)

NHST = null hypothesis significance testing

  1. we know that under \(H_0\) we expect a sample mean of \(M=6.9\)
  2. it doesn’t have to be exactly \(M=6.9\): if the null hypothesis were supported, we would find a sample mean close to that value
  3. we can use the CLT to draw up the distribution of the sample means under \(H_0\)
  4. if the observed sample mean (from our \(n=49\) sample with extra lessons) is very unlikely under the expected data, we would reject the null

Important

If the observed sample mean (from our \(n=49\) sample with extra lessons) is very unlikely under the expected data, we would reject the null.

This is why it is called null hypothesis significance testing.

But what does very unlikely mean?

The idea of significance

In NHST, very unlikely is translated to statistically significantly different.

Also called: the alpha level.

e.g. an alpha level of \(\alpha = 0.01\) means that we deem a value unlikely (or statistically significantly different) if the probability of observing it is smaller than \(\alpha\).

Return of the z-score idea

Remember: we know this probability stuff and the idea of “unlikely”!

The alpha level corresponds exactly to regions on the distribution.

More specifically:

  • \(\alpha = 0.05\) is the area in the tail(s) where values lie that have a probability of less than 5%
  • \(\alpha = 0.01\) is the area in the tail(s) where values lie that have a probability of less than 1%
  • \(\alpha = 0.001\) is the area in the tail(s) where values lie that have a probability of less than 0.1%

Alpha and critical regions

Statistical significance

We can locate the z-scores that correspond to tail proportions (and hence: probabilities).

Important:

If we have a higher than or lower than \(H_1\), then we call this a directional hypothesis.

  • this “loads all unlikeliness” to one tail

Example:

\(\alpha = 0.05\) and a directional \(H_1\) needs a z-score that has a tail prob. of 0.05.

Statistical significance

Important:

If we have a different than \(H_1\), then we call this a non-directional hypothesis (i.e. we simply state that it is different than what we expect under the null but have no idea in which direction).

  • this means that we need to “spread all unlikeliness” to both tails

Example:

\(\alpha = 0.05\) and a directional \(H_1\) needs a z-score that has a tail prob. of 0.025 (because it spreads to both tails!!!).

Kinds of hypotheses

Directional alternative hypotheses:

  • we make a prediction about the direction of the difference (higher/lower than the null mean)
  • we use a corresponding one-tailed hypothesis test
  • all unlikeliness is in one tail

Non-directional alternative hypotheses:

  • we make no prediction about the direction of the difference but simply state that it is different from the (null mean), i.e. higher or lower
  • we use a corresponding two-tailed hypothesis test
  • all unlikeliness is spread in both tails

Back to our example

We now test our hypothesis

  • We had a sample of \(n=49\) who got extra lessons
  • And we decide to deem values as unlikely that have a probability under the null of less than 1%
    • i.e. our alpha level is \(\alpha = 0.01\)

Since we have a directional \(H_1\) that states \(H_1: \mu > 6.9\), we load all unlikeliness to the right tail.

Now we gather the data

We have analysed the data of our \(n=49\) sample:

The sample mean is \(M=7.46\)

Significance testing

We obtain the z-score for the sample mean (see p. 210 in the book).

\(z=\frac{M-\mu}{\sigma_M} = \frac{7.46-6.90}{0.16} = \frac{0.56}{0.16} = 3.5\)

Thus:

  • the observed mean (with extra stats lessons) is 0.56 grades higher than what we would have expected under the null hypothesis
  • this difference corresponds to a z-score of \(z=3.5\)
  • i.e. the observed mean is 3.5 standard deviations above the null mean

Evaluating hypotheses

  • \(H_0: \mu = 6.9\)
  • \(H_1: \mu > 6.9\)
  • Observed: \(M=7.46\)
  • z-score of 3.5

Since it is a directional \(H_1\), we look at the tail prob. for \(z=3.5\).

z body tail M-to-z
3.50 .9998 .0002 .4998

Interpreting the findings

z body tail M-to-z
3.50 .9998 .0002 .4998

Observing a mean of \(M=7.46\) or higher has a probability of 0.0002 (or 0.02%) under the null hypothesis.

This is lower than our pre-defined threshold of \(\alpha = 0.01\):

We therefore reject the null hypothesis.

Our data support the alternative hypothesis that extra lessons did improve the grade.

The p-value

z body tail M-to-z
3.50 .9998 .0002 .4998

Observing a mean of \(M=7.46\) or higher has a probability of 0.0002 (or 0.02%) under the null hypothesis.

0.0002 is the p-value!

Written as \(p=.0002\)

Visually

PART 3: Errors in inference

Why errors?

Remember:

  • we are making inferences based on a sample
  • i.e. we have - by definition - limited information
  • so we might make incorrect inferences

Two kinds of errors: Type 1 errors and Type 2 errors

Type 1 errors

Analogy: false positives

We conclude there is a difference (an effect), but it’s a false alarm (in reality there is no effect).

In hypothesis terms: we reject the null but shouldn’t have done so.

Type 1 errors

We want to keep that error low.

i.e. we want to be quite sure that there is an effect.

This is all contained in the alpha level: under the null, a proportion of exactly \(\alpha\) lies in the critical region.

For \(\alpha=0.01\), 1% of the values under the null lie is that area.

Thus: in 1% of the cases, will we incorrectly conclude that there is an effect.

Type 2 errors

Analogy: missed effects.

We conclude that there is no difference, but in reality there is one (i.e. we miss the effect).

In hypothesis terms: we fail to reject the null hypothesis although we should have done so.

This error term is called \(\beta\).

More on this in the week on statistical power

In the live sessions

  • factors that are related to hypothesis testing and significance
  • step-by-step examples in hyp. testing

Recap

  • Distributions and sample
    • from population parameters to sample statistics
    • law of large numbers and the central limit theorem
    • building the distribution of sample means
  • Intro to hypothesis testing
    • two kinds of hypotheses
    • the null hypothesis of expected sample means
    • significance and unlikeliness
  • Errors in inference
    • Type 1 errors
    • Type 2 errors

Next week

  • the t-statistic